Skip to content

Conversation

@iRakson
Copy link
Contributor

@iRakson iRakson commented Mar 6, 2020

What changes were proposed in this pull request?

A new function json_object_keys is proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument.

  • If invalid json expression is given, NULL will be returned.
  • If an empty string or json array is given, NULL will be returned.
  • If valid json object is given, all the keys of the outmost object will be returned as an array.
  • For empty json object, empty array is returned.

We can also get JSON object keys using map_keys+from_json. But json_object_keys is more efficient.

Performance result for json_object = {"a":[1,2,3,4,5], "b":[2,4,5,12333321]}

Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_object_keys                                  11666          12361         673          0.9        1166.6       1.0X
from_json+map_keys                                15309          15973         701          0.7        1530.9       0.8X

Why are the changes needed?

This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings.

Some of the most popular DBMSs supports this function.

  • PostgreSQL
  • MySQL
  • MariaDB

Does this PR introduce any user-facing change?

Yes. Now users can extract keys of json objects using this function.

How was this patch tested?

UTs added.

@iRakson
Copy link
Contributor Author

iRakson commented Mar 6, 2020

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iRakson Could you show the use case of the function.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 6, 2020

We can get the same result using existing functions:

scala> val df = Seq("""{"a":1, "b":2}""").toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]

scala> df.select(map_keys(from_json($"json", MapType(StringType, StringType)))).show
+-----------------+
|map_keys(entries)|
+-----------------+
|           [a, b]|
+-----------------+

What's the benefit of having the function in public API?

@iRakson
Copy link
Contributor Author

iRakson commented Mar 6, 2020

What's the benefit of having the function in public API?

  • json_object_keys gives better performance than map_keys+from_json.
  • Users are more familiar with json_object_keys as it supported by some of popular DBMSs, so they will find it easier to use.

@iRakson
Copy link
Contributor Author

iRakson commented Mar 10, 2020

@HyukjinKwon @cloud-fan @maropu
Please take a look at this as well.

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

Do you know how much faster it is?

json_object_keys gives better performance than map_keys+from_json.

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119637 has finished for PR 27836 at commit 17caff4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class JsonObjectKeys(child: Expression) extends UnaryExpression with CodegenFallback

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119648 has finished for PR 27836 at commit 7a74b30.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Mar 17, 2020

retest this please

1 similar comment
@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

@MaxGekk does it look good to you?

@maropu
Copy link
Member

maropu commented Mar 19, 2020

Do you know how much faster it is?

json_object_keys gives better performance than map_keys+from_json.

Has the @dongjoon-hyun comment above been already adderssed? If this func has performance benefits, I think its better to show the perf. numbers in the PR description.

@iRakson
Copy link
Contributor Author

iRakson commented Mar 19, 2020

Do you know how much faster it is?

json_object_keys gives better performance than map_keys+from_json.

Has the @dongjoon-hyun comment above been already adderssed? If this func has performance benefits, I think its better to show the perf. numbers in the PR description.

I have updated the time required for map_keys+from_json and json_object_keys in PR description. I will also try to add benchmark for all the new JSON functions.
@maropu @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Mar 19, 2020

Test build #120032 has finished for PR 27836 at commit 7a74b30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Mar 21, 2020

@MaxGekk does it look good to you?

@MaxGekk Kindly Review this.

var arrayBufferOfKeys = ArrayBuffer.empty[UTF8String]
// this handles `NULL` case
if (parser.nextToken() == null) {
throw new AnalysisException(s"$prettyName expect a JSON object but nothing is provided.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the exception thrown at analysis phase? I think AnalysisException should be replaced by RuntimeException.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 21, 2020

The function works for literals but I got NPE on a column with JSON objects:

  test("json_object_keys") {
    spark.sql("""select json_object_keys('{"a": 1, "b": 2, "c": 3}')""").show(false)
    val df = Seq(
      """{"a":1}""",
      """{"a":1, "b":2}""",
      """{"a": 1, "b": 2, "c": 3}""").toDF("json")
    df.show(false)
    val dfKeys = df.selectExpr("json_object_keys(json)")
    dfKeys.show(false)
  }
+------------------------------------------+
|json_object_keys({"a": 1, "b": 2, "c": 3})|
+------------------------------------------+
|[a, b, c]                                 |
+------------------------------------------+

+------------------------+
|json                    |
+------------------------+
|{"a":1}                 |
|{"a":1, "b":2}          |
|{"a": 1, "b": 2, "c": 3}|
+------------------------+

06:28:11.089 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16(InternalRow.scala:152)
...
Caused by: java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16(InternalRow.scala:152)
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16$adapted(InternalRow.scala:151)
	at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:41)
	at org.apache.spark.sql.catalyst.expressions.JsonObjectKeys.json$lzycompute(jsonExpressions.scala:812)
	at org.apache.spark.sql.catalyst.expressions.JsonObjectKeys.json(jsonExpressions.scala:812)

@iRakson
Copy link
Contributor Author

iRakson commented Mar 22, 2020

The function works for literals but I got NPE on a column with JSON objects:

  test("json_object_keys") {
    spark.sql("""select json_object_keys('{"a": 1, "b": 2, "c": 3}')""").show(false)
    val df = Seq(
      """{"a":1}""",
      """{"a":1, "b":2}""",
      """{"a": 1, "b": 2, "c": 3}""").toDF("json")
    df.show(false)
    val dfKeys = df.selectExpr("json_object_keys(json)")
    dfKeys.show(false)
  }
+------------------------------------------+
|json_object_keys({"a": 1, "b": 2, "c": 3})|
+------------------------------------------+
|[a, b, c]                                 |
+------------------------------------------+

+------------------------+
|json                    |
+------------------------+
|{"a":1}                 |
|{"a":1, "b":2}          |
|{"a": 1, "b": 2, "c": 3}|
+------------------------+

06:28:11.089 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16(InternalRow.scala:152)
...
Caused by: java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16(InternalRow.scala:152)
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16$adapted(InternalRow.scala:151)
	at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:41)
	at org.apache.spark.sql.catalyst.expressions.JsonObjectKeys.json$lzycompute(jsonExpressions.scala:812)
	at org.apache.spark.sql.catalyst.expressions.JsonObjectKeys.json(jsonExpressions.scala:812)

I fixed this issue and also added a new test case for this particular issue.

@SparkQA
Copy link

SparkQA commented Mar 22, 2020

Test build #120162 has finished for PR 27836 at commit ce18e41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


override def eval(input: InternalRow): Any = {
try {
lazy val json = child.eval(input).asInstanceOf[UTF8String]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need lazy here?

arguments = """
Arguments:
* json_object - A JSON object is required as argument. `Null` is returned, if an invalid JSON
string is given. `Runtime Exception` is thrown, if null string or JSON array is given.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

      * json_object - A JSON object. If it is an invalid string, the function returns null.
              If it is a JSON array or null, a runtime exception will be thrown.

btw, the error handling is consistent with the other JSON functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the other json functions. IllegalArgumentException is thrown by from_json for invalid inputs and it is better than RuntimeException.


test("json_object_keys") {
val df = Seq("""{"a": 1, "b": 2, "c": 3}""".stripMargin)
.toDF("json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need the line break here.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120213 has finished for PR 27836 at commit bac0b75.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120215 has finished for PR 27836 at commit de7e02b.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 24, 2020

Test build #120264 has finished for PR 27836 at commit de7e02b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Mar 27, 2020

gentle ping @HyukjinKwon @maropu @MaxGekk

-- !query schema
struct<>
-- !query output
java.lang.IllegalArgumentException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you throw IllegalArgumentException for '[1, 2, 3]' but NULL for '{"key": 45, "random_string"}'? It looks slightly inconsistent though both input are invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For invalid JSON strings NULL is thrown. '[1, 2, 3]' is a valid JSON string. But not a JSON object. So invalid argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this issue. I am thinking of implementing a function ISJSON(). It will return NULL for null, 1 for valid JSON string and 0 for invalid JSON string.

This function is supported by most of the major DBMSs.

Currently we return NULL for invalid JSON.

In this PR as well as in #27759 , NULL is returned for null JSON strings. So behaviour might be confusing as we are returning NULL for invalid JSON strings as well.

@SparkQA
Copy link

SparkQA commented Apr 6, 2020

Test build #120874 has finished for PR 27836 at commit e45e61c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @iRakson .
Could you resolve the conflicts?

("""{"key": 45, "random_string"}""", null),
("", null),
("[]", null),
("""[{"key": "JSON"}]""", null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you sort the test cases a little meaningfully, please?

("", null),
("[]", null),
("""[{"key": "JSON"}]""", null)
).foreach{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. ).foreach{ -> ).foreach {

select json_object_keys(200);
select json_object_keys('');
select json_object_keys('{"key": 1}');
select json_object_keys('{}');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we switch line 67 and 68?

val df = Seq("""{"a": 1, "b": 2, "c": 3}""").toDF("json")
val dfKeys = df.selectExpr("json_object_keys(json)")
checkAnswer(dfKeys, Row(Array("a", "b", "c")))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can revert this change from this PR since we have a superset already.

* A function which returns all the keys of the outmost JSON object.
*/
@ExpressionDescription(
usage = "_FUNC_(json_object) - returns all the keys of the outmost JSON object as an array.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. returns -> Returns?

parser => getJsonKeys(parser, input)
}
} catch {
case _: JsonProcessingException => null
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: There is no need to handle IOException here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nextToken() throws IOException. I will update it with other changes.

// return null if an empty string or any other valid JSON string is encountered
if (parser.nextToken() == null || parser.currentToken() != JsonToken.START_OBJECT) {
return null
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move this exception handling part (line 849 ~ 851) from this getJsonKey into line 839 (outside of this function)?

@iRakson
Copy link
Contributor Author

iRakson commented Apr 8, 2020

@dongjoon-hyun I resolved the conflicts and also have tried to handle all the review comments.
Kindly review.

@iRakson iRakson requested a review from dongjoon-hyun April 8, 2020 08:39
@SparkQA
Copy link

SparkQA commented Apr 8, 2020

Test build #120963 has finished for PR 27836 at commit 7eba58c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2020

Test build #120964 has finished for PR 27836 at commit faa3571.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @iRakson and all.
Merged to master for Apache Spark 3.1.0.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
A new function `json_object_keys` is proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument.

- If invalid json expression is given, `NULL` will be returned.
- If an empty string or json array is given, `NULL` will be returned.
- If valid json object is given, all the keys of the outmost object will be returned as an array.
- For empty json object, empty array is returned.

We can also get JSON object keys using `map_keys+from_json`.  But `json_object_keys` is more efficient.
```
Performance result for json_object = {"a":[1,2,3,4,5], "b":[2,4,5,12333321]}

Intel(R) Core(TM) i7-9750H CPU  2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_object_keys                                  11666          12361         673          0.9        1166.6       1.0X
from_json+map_keys                                15309          15973         701          0.7        1530.9       0.8X

```

### Why are the changes needed?
This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings.

Some of the most popular DBMSs supports this function.
- PostgreSQL
- MySQL
- MariaDB

### Does this PR introduce any user-facing change?
Yes. Now users can extract keys of json objects using this function.

### How was this patch tested?
UTs added.

Closes apache#27836 from iRakson/jsonKeys.

Authored-by: iRakson <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@svmgarg
Copy link

svmgarg commented May 31, 2023

We can get the same result using existing functions:

scala> val df = Seq("""{"a":1, "b":2}""").toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]

scala> df.select(map_keys(from_json($"json", MapType(StringType, StringType)))).show
+-----------------+
|map_keys(entries)|
+-----------------+
|           [a, b]|
+-----------------+

What's the benefit of having the function in public API?

We don't need to define the schema in Json_object_keys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants